Model Selection

Multimodal instruction understanding

# Multimodal instruction understanding

Mistral Small 3.2 24B Instruct 2506 GGUF

Mistral-Small-3.2-24B-Instruct-2506 is an image-text-to-text model that performs excellently in model quantization and shows significant improvements in instruction following, reducing repetition errors, and function calls.

Image-to-Text Supports Multiple Languages

Ultravox V0 5 Llama 3 1 8b

A multilingual audio-to-text model based on Llama-3.1-8B-Instruct, supporting processing of over 40 languages

Large Language Model

Transformers Supports Multiple Languages

Qwen2.5 VL 32B Instruct GGUF

Qwen2.5-VL-32B-Instruct is a 32B-parameter multimodal vision-language model that supports joint understanding and generation tasks for images and text.

Image-to-Text English

Qwen2 VL 2B Instruct

Qwen2-VL-2B-Instruct is a multimodal vision-language model that supports image-text-to-text tasks.

Transformers English

Instructclip InstructPix2Pix

InstructCLIP is an instruction-guided image editing model improved through contrastive learning-based automatic data optimization. It combines CLIP and Stable Diffusion technologies to edit images based on textual instructions.

Text-to-Image English

Phi 4 Multimodal Instruct

Phi-4-multimodal-instruct is a lightweight open-source multimodal foundation model that integrates language, vision, and speech research and datasets from Phi-3.5 and 4.0 models. It supports text, image, and audio inputs to generate text outputs, with a context length of 128K tokens.

Multimodal Fusion

Transformers Supports Multiple Languages

Phi 4 Multimodal Instruct Onnx

ONNX version of the Phi-4 multimodal model, quantized to int4 precision with accelerated inference via ONNX Runtime, supporting text, image, and audio inputs.

Multimodal Fusion Other

Llama 3.2 11B Vision Instruct GGUF

Llama-3.2-11B-Vision-Instruct is a multilingual vision-language model that can be used for image-text to text conversion tasks.

Transformers Supports Multiple Languages

Taivisionlm Base V2

The first vision-language model supporting Traditional Chinese instruction input (1.2B parameters), compatible with Transformers library, quick to load and easy to fine-tune

Transformers Chinese

Octo Small is a diffusion policy model for robot control, based on Transformer architecture, capable of predicting robot actions from visual inputs and language instructions.

Multimodal Fusion

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase